<?php
/**
 * <https://y.st./>
 * Copyright © 2016 Alex Yst <mailto:copyright@y.st>
 * 
 * This program is free software: you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 * 
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
 * GNU General Public License for more details.
 * 
 * You should have received a copy of the GNU General Public License
 * along with this program. If not, see <https://www.gnu.org./licenses/>.
**/

$xhtml = array(
	'<{title}>' => '<code>\\CURLOPT_NOPROGRESS</code>, <code>\\curl_getinfo()</code>, and <code>\\parse_url()</code>',
	'<{body}>' => <<<END
<p>
	I did a bit of searching, and I found that there is a <a href="https://bugs.php.net/bug.php?id=65236">bug in how <code>\\xml_parse_into_struct()</code> handles deeply-nested tags</a>.
	According to the Red Hat bug report, this bug is capable of <a href="https://bugzilla.redhat.com/show_bug.cgi?id=983689">crashing a script</a>.
	Due to the fact that the file that crashes the spider not being an actual $a[XML] file, I have no idea how it is being interpreted.
	The spider only sends it to that function because it has no way to know if the file is an $a[XML]/$a[HTML] file, so it sends all files through that function.
	After several more hours of debugging though, I finally found a problem in my spider itself that is making this problem worse: the progress function that I set is not being called! I removed a line that set the <code>\\CURLOPT_NOPROGRESS</code> option to false, as I thought that it was leftover code that was not needed.
	The $a[PHP] manual says that <a href="https://secure.php.net/manual/en/function.curl-setopt.php"><code>\\CURLOPT_NOPROGRESS</code> should only be used for debugging</a>, but this is a lie! If <code>\\CURLOPT_NOPROGRESS</code> is not set to false, the progress function will never be called.
	The <a href="https://stackoverflow.com/questions/17641073/how-to-set-a-maximum-size-limit-to-php-curl-downloads">original instructions for setting up a file size limit</a> were very helpful, as was the fact that my limiting function was set to generate output, though not output was being generated.
	Once I had the problem fixed though, too much output was being generated, ,so I had to disable that feature.
	I set up a new setting to re-enable it in the configuration file though.
	It seems that my file size limit might have been too high though, so I turned it down to eight megabytes.
	I am unsure if this file size is low enough to prevent future problems with non-$a[XML] files or if the problem is only prevented for now because the file that triggered the error this particular time is larger than this limit.
</p>
<p>
	I looked further into the <a href="https://secure.php.net/manual/en/function.curl-getinfo.php"><code>\\curl_getinfo()</code> function</a>, but it seems that it is not suitable for use in the spider.
	If a single $a[cURL] handle is used for multiple file transfers, the output of this function gets a bit muddled.
	Specifically, if this file is called multiple times, any value that the function returned on a previous call will be called again, unless that value is overridden by a new value.
	That means that if the function is not yet able to retrieve information on the current download when called from within my <code>curl_limit</code> class, it will reuse the last-known value, which could be from a different download.
	I theorize that if I modified the <code>curl_limit</code> class to abort selected downloads based on the Content-Type header, files that the spider attempts to retrieve after the aborted download will also be potentially aborted, as <code>curl_limit</code> objects seem to get called several times before the server actually returns data, so the headers are not yet available.
</p>
<p>
	On the topic of optimization of include.d code, I realized today that my <code>CURLOPT_TOR</code> constant has an issue.
	It relies on the \\CURL* constants, which unless the $a[cURL] extension is installed and loaded, do not exist.
	This constant is only useful when the $a[cURL] extension is loaded, but it is defined any time any constant in the name space is needed.
</p>
<p>
	We went out to some dunes to do some cooking on the trail, as Cyrus had to do that today to meet one of his Boy Scout requirements.
	Unfortunately, he forgot to bring both the metal screen to put over the fire and the pot to cook in, so we had to go back home to get them, using up some of what little time we have.
	Upon returning to the dunes, we gathered firewood and sap to build a fire and cook a meal.
	We then rushed back into town so that Cyrus could meet up with one of his counselors and get some of his requirements signed off on.
</p>
<p>
	While he was meeting with the counsellor, Vivian gave me a calendar and newspaper from the volunteer organization that she works with to look over.
	Today was hectic, so I was unable to look them over today, but hopefully they will give me a better idea of what she does.
</p>
<p>
	Upon returning home, Cyrus went over a few quick things with us in order to pass off more requirements.
	We discussed emergency situations and what to do in them, put together an emergency backpack, then discussed the family finances.
	After that, we were off to the local public library to finish our work there.
	We finished pruning the analog electronic media ($a[VHS] cassettes and audio cassettes), then finished moving them all over to a single section to condense them.
	We cleaned off the dust on the freed shelves, reconfigured the shelves to have different spacing, and spread the digital electronic media out to allow room for collection growth.
</p>
<p>
	Once home from the library, Cyrus worked more on his Boy Scout requirements on his own.
	For me, it was time to once again debug the spider.
	It was again choking of a supposedly-empty $a[URI] due to that bug that I intentionally did not fix <a href="/en/weblog/2016/01-January/19.xhtml">the other day</a>.
	This time, however, my code was just fine; I had not gotten a blank $a[URI] due to an error in my code.
	I tracked down the hyperlink that caused the issue, and it was pointed at the $a[URI] &quot;#fn:1&quot;.
	I immediately saw the problem: <code>\\parse_url()</code> was seeing the colon and getting confused.
	To test my theory, I called <code>\\parse_url()</code> directly and passed in the troublesome $a[URI] fragment, and sure enough, the function returned falls.
	The next step was clear.
	I needed to find out whether the $a[URI] fragment is valid.
	If the formatting is valid, I would know that <code>\\parse_url()</code> is broken and I need to rewrite <code>merge_uris()</code> to use some other method or breaking $a[URI]s into their components.
	On the other hand, if that syntax is invalid, this was not a problem in <code>\\parse_url()</code>.
	Instead of rewriting <code>merge_uris()</code>, I needed to simply fix the bug in the spider.
</p>
<p>
	As it turns out, that $a[URI] fragment is <a href="https://tools.ietf.org/html/rfc3986#section-3.5">completely valid</a>.
	I that same document, an example regular expression that parses a $a[URI] is present in the document, though the $a[RFC]s are not under a free license.
	Luckily, someone that prefers to remain unmentioned helped me find a document saying that <a href="http://trustee.ietf.org/license-info/IETF-TLP-5.htm">$a[IETF] document code components are covered by the Modified $a[BSD] license</a>.
	I should be able to construct a working function tomorrow if I have time.
</p>
<p>
	In the end, Cyrus was unable to meet his project deadline of midnight tonight.
	However, it seems that he has been granted a short extension, allowing him to finish up by nine tomorrow.
</p>
<p>
	My <a href="/a/canary.txt">canary</a> still sings the tune of freedom and transparency.
</p>
<section id="docmod">
	<h2>Document modifications</h2>
	<p>
		On <a href="/en/weblog/2018/01-January/16.xhtml#Vivian">2018-01-16</a>, my sister, Vivian, requested that I replace all instances of her legal name in my journal with the name &quot;Vivian&quot;.
		She also asked that the name of the organisation she works for be redacted.
		This page was modified to fulfil that request.
	</p>
</section>
END
);
